# ntxh

This branch contains data and code accompanying several articles/papers on Cognitive Linguistics, including the article "Age-related hearing loss, speech understanding and cognitive technologies" by Joseph Lehmann, _et. al._, published as of March 2021 in the _International Journal of Speech Technology_ (IJST). Thank you for your interest in the article, dataset, and code library. 


This repository also contains a collection of audio samples used for speech quality measurement, republished from the IJST article _Subjective Speech Quality Measurement With and Without Parallel Task: Laboratory test results comparison_ by Hakob Avetisyan and Jan Holub. These audio samples are included to demonstrate techniques for evaluating speech quality that might be used in the context of assessing technologies which modify speech input in light of cognitive-linguistic principles, as proposed in Joseph Lehmann's article. 

In addition there is republished data reviewed in _Advances in Ubiquitous Computing: Cyber-Physical Systems, Smart Cities and Ecological Monitoring_, Chapter 3. The text for this chapter (as generated by the data-set's own PDF tools) and for the book itself are in the "`documents`" folder. The _Advances in Ubiquitous Computing_ data is an avian data set (see `https://datadryad.org/stash/dataset/doi:10.5061/dryad.j2t92` for the original) demonstrating certain data-management techniques through requirements associated with audio/acoustical processing.

# Introduction

---
**The Primary Data Set and (ideas for) a Cognitive-Linguistic Annotation Framework**

The primary data set, accompanying Joseph Lehmann's publication, comprises 553 natural-language sentences drawn from several sources, chosen to emphasize different topics in cognitive linguistics and prosody. As explained in the article, one goal of the data set is to illustrate cognitive structures which should be represented in a language-annotation system that could be applicable for multiple language-processing and bioinformatic use-cases. 

As one use-case, Lehmann _et. al._ analyze the goal of describing acoustical transformations for audio recordings (or real-time audio input) of spoken language which would clarify the audio data for the hearing-impaired, using transforms derived from cognitive-linguistic research (which can, potentially, be more fine-grained then just eliminating background noise, or modulating speech volume to an even level, or other relatively simple audio manipulations). 

Other potential applications for cognitively-inclined language annotation may involve notating speech and intonation patterns via strucures which convey their cognitive origins, or simply documenting syntactic or semantic forms (such as dependency grammar) in terms of representations whose theoretical origins lie in cognitive linguistics (such as Construction Grammar).

Currently, although individual cognitive linguists may have distinctive styles of representing or diagramming linguistic patterns (or semantic frames, parse-structures, and so forth), there does not exist a conventional notation for analyses in this tradition comparable to GF (the Grammatical Framework), Link Grammar, FrameNet, or other linguistic perspective rooted in relatively format/computational paradigms (Martin-Löf type theory, frame semantics, Combinatory Categorial Grammar, etc.). This data set has been curated with the hope of illustrating some of the technological requirements for standardizing a reasonably broad-based cognitive-linguistic digital ecosystem, which could involve annotations, corpus and lexicon development, parse-representations, prosodic/intonation descriptions, as well as graphics formats that would be unique to cognitive linguistics given its specific themes (such as landmark/trajector or cognitive-grounding diagrams).

Each of the 553 samples are discussed in some detail in the course of several documents included with the data set (some of the samples are also analyzed in prior literature, as referenced within the data fields). The "`documents`" folder includes three essays by Nathaniel Christen which study the samples from different perspectives in Cognitive Grammar. The language samples  are printed in the text as example case-studies -- as is common in pragmatics and related linguistic disciplines -- and the core data set is compiled in its aggregate form by pre-parsing the source code from which these documents were generated.

For readers who find it difficult to compile and use the data-set application (for browsing the samples in isolation), or who don't want to take the time to work with any of these programming tools, the samples can be browsed collectively through a simple Markdown file ("`samples.md`") in the `@/documents/markdown` folder -- click [here](./documents/markdown/samples.md) -- to view this file without cloning the repository, if desired (the "`@`" symbol designates the repository root folder). The complete data-set repository contains many files which demonstrate technological features associated with the idea of a cognitive-grammar description language, in addition to the specific data set comprised of these samples, so readers who do not intend to work with any of the supplemental source code directly can use the "`samples`" file for a quick preview of the data set. 

Readers may also want to browse the essays where the samples are analyzed and from which the samples are extracted, particularly the "CognitiveTransformGrammar.pdf" file in "`@/documents`" (this is the most polished paper; the other two are more provisional, but all three were scraped to compile the data set). In lieu of cloning the repository, readers may wish to start by looking at the essay on "Cognitive Transform Grammar" -- click [here](https://github.com/scignscape/ntxh/blob/ctg/documents/CognitiveTransformGrammar.pdf) -- which discusses roughly half the samples. Note that the analyses (and the curation of the data set as a whole) are those of Nathaniel Christen, a co-author but not primary author on Joseph Lehmann's paper, and don't necessarily reflect the views of the other authors (though I have no reason to suspect that they would substantially disagree with my argumentation). 

Amongst the data set, some of the samples were previously analyzed (or similar to ones previously analyzed) by linguists such as Ronald Langacker, Jens Allwood, and James Pustejovsky; some samples with prosodic markup or distinct prosodic features are taken from Langacker, Julia Hirshberg, the Blackwell _Handbook of Pragmatics_, and an article by Laura A. Michaelis and Knud Lambrecht; and other samples are taken from a CoNLL corpus (see below). 


---
**Notation**

Source files in the special "`GTagML`" language (see below) include notation for typesetting, 
metadata, and diagrammatic treatment of language samples (e.g. prosodic markup, parse-graphs, 
citation of corpora sources, and so forth). The authors propose a `TagML` based format for 
annotating language samples and their metadata (for purposes of corpus-building and 
general-purpose linguistics publication). The code accompanying this data set represents 
the _kind_ of coding requirements necessary for establishing a general-purpose sample-annotation 
system for Cognitive Linguistics. Readers may examine the specific syntax used for this 
data set, but the code can be refactored to work with other markup-language grammars as well.

Several language examples use various forms of prosodic/intonation markup, including 
ToBI (Tones and Break Indices) notation. Following the conventions of linguists such as 
Hirshberg and Langacker, other samples employ bold or small-caps to mark full or partial 
stress, respectively. Langacker's double-slash notation to mark potential intonation gaps 
is carried over to his examples present in the data set.

In general, a robust "Cognitive Grammar Annotation and Metadata Language" would have 
multiple components, including (1) straightforward sample text; (2) notation of sample 
sources in prior literature/corpora as well as their position in the current publication 
where they are presented (e.g. page and section numbers); (3) notation of parse-graphs 
or other parsing structures and/or semantic/lexical data or interpretations, appropriate 
for the publications' methods and theoretical perspectives; (4) prosodic/intonation markup 
and/or discourse/conversational representations when needed; and (5) identification of issues 
or analytic commentary on samples, such as markup dubious or unacceptable examples 
introduced to illustrate theories for why certain constructions are ill-formed. 

Each of these sorts of information can be modeled via notation used for this data set, 
though the language for doing so has not (yet) been formally described. For readers 
who would like to examine this notation notwithstanding its provisional/experimental 
status, a list of the "`GTagML`" sources for every sample is [here](https://github.com/scignscape/ntxh/blob/ctg/documents/all-samples.gt). 

Readers interested in speech/prosody may want to search for 
examples of the "`discourse-markup-inline`" command and view 
corresponding sample markup, as well as samples in the same 
section, to see the syntax and commands relevant to prosodic 
markup in particular. Similarly, readers interested in corpus curation 
may wish to search for "`udref`" commands to see the notation 
used to indicate samples obtained from a CoNLL shared task collection.


# Technology Overview
---

__Note:__ For "getting started" information on how to launch a basic version of 
the data-set application, which may help to orient the rest of this 
discussion, see _Setting Up a Development Environment_ later in this section.

The data set includes code for a novel document-preparation format with a LaTeX preprocessor and C++ callback engines programmed to identify the samples and export them to a dataset. This technology also implements a pipeline where sentence boundaries and other textual markers are located in the text-encoding and merged with PDF coordinate data generated when running `pdflatex` with auxiliary files (via LaTeX `write` functions). 

This software uses a hypergraph-based encoding system to construct and then merge the text-encoding data with the PDF coordinate metadata, embedding metadata within the generated PDF file that can then be extracted with the special PDF viewer (part of the data set's code sources). That viewer (based on Glyph and Cog's XPDF; see below) implements some text-encoding-driven features such as copying individual sentences (by mapping text encoding to sentence boundary PDF locations). Nathaniel's _Advances in Ubiquitous Computing_ chapter (the PDF form in this repo, not in the published book) was generated with similar tools.

One goal when curating this data set was to create an (almost) entirely self-contained computing environment for Cognitive Linguistic research, document-preparation, and corpus curation. This repository includes source code for a custom markup language "`GTagML`" (inspired by TagML, the Text-as-Graph Markup Language invented by the Huygens ING digital-humanities research institute in the Netherlands) as well as the custom PDF viewer which can read the metadata generated by our version of TagML. There is also source code for NTXH itself, a hypergraph-based data modeling and serialization format which is used (a little) as part of the metadata pipeline, as well as for a graph database engine which could potentially form the basis of a linguistic corpus system. 

We publish these papers and source code in part to invite other authors/researchers to consider how we could set up similar workflows for other projects. We are especially interested in constructing corpora of samples taken from Cognitive Linguistics literature. We invite linguists who are interested in contributing samples/analyses of their own to add to the current data set or as the basis of a new data set!

---
**Samples**

The primary purpose of this data set is to provide a collection of language samples (mostly sentences) which illustrate various points, claims, and theories in the context of Cognitive Linguistics. The focus is on Cognitive Grammar in particular, including consideration of how discursive constructions signal speaker-relative epistemic orientation and situational understanding, and how these patterns can be manifest in prosody (speech patterns) and communicative interactions in general. The samples (mostly sentences) are drawn from a variety of sources, including some invented to illustrate or examine claims in the context of cognitive grammar. Other samples are selected from writings of linguists such as Ronald Langacker, Geoffrey Nunberg, and Julia Hirshberg, and from a recent Conference on Natural Language Learning corpus. Collectively the sentences/samples are annotated with different kinds of parses, prosodic descriptions, or other analytic structures with the goal of exploring cognitive processing and how the cognitive dimensions of understanding can be related to other aspects of languages, such as speech and prosody, as well as prelinguistic situational awareness. The data set includes code and special software for viewing the samples, together with prototypes for novel document-generation formats, database management tools, and other programming concerns which could potentially be integrated into a cognitive-linguistics software platform.

The sample annotations use several formats including 

1) S-Expressions to indicate parse structures
2) CoNLLU (Congress on Natural Language Learning -- "Universal" format) dependency graphs
3) Prosodic Annotations using ToBI (Tones and Break Indices) or other markup to suggest intonation boundaries, stress patterns, and so forth.
4) Proposed variations on dependency graphs using De Bruijn indices, or in another context word-pairing sequences to decompose dependency parsers into particular head/dependency relations

Most of the samples are also printed inline in one of three accompanying documents discussing various linguistic and philosophical issues relevant to Cognitive Grammar. Dataset users may, if desired, read or skim over those documents to see the samples in the context of arguments discussing/analyzing them, as well as browsing the samples as a collection via the software applications whose code is provided with the data set. If built with PDF support, users can switch between the dataset application and the PDF documents by right-clicking on any of the samples; one context menu option provides a feature for launching a PDF viewer and reading the sample in the document context. 

This data set and code library are published with the hope of helping to seed or inspire a broader software project in cognitive linguistics and particularly cognitive grammar, which could include features such as supporting different parse and analytic representations of language artifacts (e.g. sentences) annotated from a cognitive-grammar perspective, generating and visualizing parse-graphs and diagrams illustrating cognitive-grammar principles, integrating cognitive grammar with prosody and other linguistic concerns as well as cognitive science more generally, and so forth. The goal of collecting samples from existing linguistics literature -- such as those in the current data set from the _Blackwell Handbook of Pragmatics_ -- could also easily be expanded to other sources, creating a larger corpus of samples which have been used in the literature to illustrate cognitive-linguistic claims.

---
**Data-Set Application**

While some of the materials included in this repository can be viewed with ordinary PDF applications or generic text readers, conveniently accessing the central cognitive-linguistic data set requires compiling the accompanying code. This code produces a self-contained "dataset application" which allows users to scroll through the samples, and will connect samples to a custom PDF viewer and PDF documents where the samples are discussed/analyzed. To build the application and the PDF viewer, it is necessary to install Qt, a C++ GUI library, which is freely available. The easiest way to use the data set is to run the dataset application from within the Qt Creator API.

The code base currently uses Qt5, but a newer Qt major version, Qt6, was released at the end of 2020 with some significant changes. The current repository should therefore be considered "work in progress" with one outstanding task being to port all the C++ code to work with Qt6.

This repository demonstrates a fairly extensive range of techniques and features related to dataset management and matching datasets with custom-built applications. There are demo-related components concerning text encoding, database engineering, data serialization, dataset organization, and GUI implementation, including testing and documentation. At this stage, many of these components are more conducive to in-person demonstration than out-of-the-box usability, and the documentation for most of the repository code is incomplete. For these reasons, users are invited to ask questions about the different components that may be found here by browsing through the code base. Hopefully future versions of this repository will include more detailed assistance for users seeking to try out the text-encoding, dataset curation, PDF generation, and related tools for their own C++ projects.

Note finally that the data-set application is implemented as a 
"dialog box", rather than a conventional "main window" (which in a 
typical desktop application would have a file menu and toolbar). 
Although dialog boxes (technically, instances of the `QDialogBox` C++ 
class) are not usually designed to serve as the primary window 
of an application, they can be programmed to function like a 
main window (i.e., in the Qt context, instances of the 
`QMainWindow` class) if desired. One rationale for adopting this 
design is to implement components which can be used as 
standalone applications (such as data-set applications) and 
could also be reused as embedded components in other software. 
For instance, this data-set application or future components with 
a similar design could be embedded in software used for 
curating linguistic and/or speech corpora or academic 
texts, particularly those based on Qt, such as Praat, 
Praaline, BiblioteQ, LitSoz, or TexStudio.

---
**Console Builds and Utility Scripts**

Some technical information about building/compiling the data set source code 
is below. Some of this information may be relevant to users for purposes of 
explaining how the data set files are organized (and on some cases generated) 
so these details are provided here.

The easiest way to use the data-set application is via Qt Creator, an 
Integrated Development Environment (IDE) especially designed for 
C++ projects which use the Qt application-development framework. 
Within Qt creator, the data-set application (and supplemental 
testing/development features, if desired) can typically be built and run 
with only one or two steps. Full discussion of Qt Creator is outside 
the scope of this overview, but it is easy to find documentation 
for Qt Creator (which is free for non-commercial use) explaining 
how to build and run C++ projects. In the data set, all Qt "project" files are 
located under the "`@/code/cpp/projects`" directory (again, the "`@`" symbol designates the repository root folder).

In some circumstances a user may prefer to build some of the data-set libraries 
or executables from a command line instead of within Qt Creator. Each 
of the data-set projects is designed to identify during compilation (and 
if necessary at runtime) whether it is being built in this 
"console" mode and alter its behavior accordingly. The `"@/code/cpp/qmake-console"` 
directory includes a "`projects`" folder which contains (within subfolders) 
command-line scripts that run "`qmake`" and can run "`make`" (thereby 
replicating the build steps which Qt Creator takes) on many of the project 
files that users would otherwise access via the IDE. Scripts (called "`run`" 
or some variation thereof) within the project-specific folders allow the 
projects to be run via the command line when their code compiles to an 
executable (rather than a library).

Casual users would presumably have little reason to build and run projects 
in this manner. However, the project-specific run scripts have the 
side-benefit of allowing certain steps in preparing the data set to be 
executed automatically. In particular, the `"@/code/cpp/qmake-console/util"` 
folder includes several scripts which are executed prior to revisions of the data 
set being uploaded. These scripts automate numerous tasks 
related to preparing the data set and its associated documents, including 
generating "`LaTeX`" sources, running the system "`pdfLatex`" command, 
merging LaTeX-derived auxiliary files with data structures generated 
by `GTagML`, aggregating the data set into different formats for 
publishing in different locations, and copying all the data-set related 
files to proper locations in the project file system.

The data set is structured so that the editing and preparing of documents is 
conducted in folders which do not become part of the data set itself. 
By default, the publications (or supplementary/explanatory texts) which 
are intended to be part of or associated with the data set are assumed to be 
located within a "`@/dev/documents`" directory which is a _sibling_ folder 
to the repository root. A utility scripts creates a zipped version 
of the `GTagML` files most recently used to create relevant documents 
and places the resulting zip file in the repository's "`@/dev/documents/gen`" 
folder, but it is assumed that those files will be unzipped to their 
expected out-of-source location for any user who wishes to recreate 
the steps taken to construct the data set to begin with (a utility 
script is included which does this automatically).

Note that the data-curation tools within the data set assume that the 
document sources from which the data set is extracted have a specific 
file organization (some of this structure is set up by 
default by `GTagML`-related executables or by utility scripts). At 
present this repository makes no effort to thoroughly document 
the necessary folder structure or proper usage of the `GTagML` scripts 
and projects. Users wishing to examine these tools in greater detail 
are invited to request more information. Note that these details are 
only relevant to those who wish to replicate or analyze how this
data set is generated, as opposed to simply viewing the language 
samples. 

Every effort has been made to include tools for aggregating the data set 
itself within this code repository -- partly to serve as a prototype 
of a system for curating other linguistics (and especially 
Cognitive Grammar) data sets in the feature, and partly in 
the spirit of transparency (insofar as those who publish 
research data should attempt to clarify the source, origins, 
provenance, and methodology applicable to the data's origination).

As a shortcut for moving development to a different machine, the 
script "`@/code/cpp/qmake-console/util/quick-setup.sh`" tries 
to automatically run other utilities and copy folders 
for prearing a `GTagML` workflow. Again, these steps are 
not relevant to typical users who do not need to 
reconstruct how the data set itself is generated. However, 
examining this script and those which it calls can serve 
as partial documentation of the data-set-generation workflow. 


---
**Setting Up a Development Environment**

Once you have cloned the data set (see below) and wish to view the data set through 
the data-set specific application (rather than via Dryad or accompanying papers) 
it is necessary to compile and build the data-set application. The easiest way to 
do this is via Qt Creator. This section assumes that each user has Qt (the libraries) 
and Qt Creator (the IDE) installed on their computer and has no difficulty compiling, 
building, and running C++ projects within this IDE.

To get started with the data set, begin with the project called "`build-first`". 
Specifically, within Qt Creator open the project at the location 
"`@/code/cpp/projects/qt/qtm/unibuild/dsmain/build-first.pro`". 
(For convenience a shortcut link to the "`unibuild`" folder is provided 
in the root folder of the archive; most of the useful project files 
are in the "`dsmain`" folder inside that "`unibuild`" link target). 
This "`build-first`" project should 
build and run without further preparation (except for potential 
font-related issues; see below), assuming the development environment 
is set up for Qt in general. This build option provides a basic version of 
the data-set application and can serve as a check that the development environment 
is working correctly.

To examine more advanced features of the data set, it is necessary to set up a 
"`preferred`" folder, which must be a _sibling_ folder to the archive 
root, one called "`preferred`". Inside that folder would be several files indicating 
"preferred" paths for certain tools needed for the data set (except for 
basic usage), such as `qmake` and `pdflatex`, as well as several "`.pri`" 
files (which are similar to Qt `pro` project files but can be included in other 
"`.pri`" or "`.pro`"s) providing users opportunities to customize certain 
dependencies and out-of-source locations. Users can examine the 
`@/preferred` folder for a template of the proper out-of-source 
`preferred` folder (note that `sysroot`-related files are not needed 
by most users). Within the `@/preferred` folder is a zipped `preferred-zip` 
file for convenience, which can be extracted (within the convenience "`@/temp`" 
folder, if desired), then copy the `preferred` subfolder to the 
parent directory of the archive root, so that `preferred` becomes the root's sibling. 
(If desired, check the `quick-setup` script mentioned earlier to see how 
this step may be automated.)

With the `preferred` folder in place, the simplest build option to 
try next is `build-quick`, which is located in the same folder as 
`build-first`. It should be possible to build and run `build-quick` 
at this point providing essentially the same data-set application 
as before (via `build-first`) but with additional built-in features.

If the `preferred` setup is different from what the data-set code expects 
the `build-quick` option will likely fail to link, reporting an 
error about missing font-related libraries (i.e., "`undefined reference`"s).

More complex build options and use-cases are discussed below.

---
**Configuring which Features are Built**

Build options such as `build-first` and `build-quick` should suffice for most 
users, but the data set includes features to fine-tune which capabilities 
are included when the data-set application is compiled. If desired, users 
can choose to incorporate features for testing the application via 
GUI-guided test actions, to construct a customized build configuration, 
and to explore various other interactive features which lie outside 
the basic operations needed to access the data set.

The data set does not use __cmake__ or other conventional build systems 
other than __qmake__. To provide customizable configuration the 
data-set code relies on strategically placed Qt project (and project-include 
"`.pri`") files. This experimental setup has the advantage of freeing the 
code base from relying on any external build system (apart from Qt's, 
which is not really "external" insofar as the data set is built 
on a Qt foundation) but has the disadvantage of adding addition source 
components and requirements to be contained. The full scope of this 
configuration model is not directly relevant for most uses of the 
data set itself, but is included as an example of the sort of 
development requirements which should be addressed by a potential 
"Cognitive Grammar" software ecosystem, and potentially reused as 
part of such an ecosystem.

The "`config-dialog`" project provides a GUI component (which can 
depending on the build model be integrated with the main 
data-set application) allowing custom configuration to be 
performed via GUI actions rather than with build scripts. 
Again, this code is experimental and intended as a prototype 
as much as for practical use within the current data set. 
However, it may be helpful for users who wish to experiment 
with including or excluding different data-set features. 
Note that (without modifying the configurations for 
pre-defined build models) this dialog is only compiled when 
using the "`build-most`" or "`build-all`" options. 

---
**UDPipe**

The data set's source code and project files includes code related to UDPipe, a tool for reading files in the Universal Dependency format, developed by the Conference on Computational Natural Language Learning (CoNNL). At this point, the primary purpose of including UDPipe is for reading the sample data and parse representations for those samples derived from a CoNLL-U corpus and included in the data set. As discussed in the essays where the samples are analyzed, a trained CoNLL Dependency Parser was run against some samples so as to demonstrate Universal Dependencies as one grammatic paradigm and methodology, but the current data set and code does not prioritize automated sentence parsing; instead, the UDPipe code is included primary so as to access CoNLL-U files.

With that said, the data set does include a number of features that may be helpful when working with Dependency Grammars or related methodologies. In addition to code for interoperating with the CoNLL-U format, the essays (via LaTeX sources) demonstrate how to embed Dependency parses in PDF files utilizing a package called "`Tikz-dependency`" (authored by Daniele Pighin of Google Research). In addition, the data-set application includes a GUI component for building parse-representations based on either conventional Dependency Grammar or on Link Grammar (see the "`link pair dialog`" options available through context menus in the central data-list area of the data-set application). This "link pair" dialog is experimental but could be developed into a manual tool for describing parse structures (either through link/dependency graphs or via S-Expressions). I used the dialog to construct the S-Expression notations for samples when these were included in the essays, since the dialog has features such as confirm that parentheses are properly balanced.

It remains an open question whether frameworks such as Dependency/Link grammar resonate more rigorously with Cognitive-Linguistic perspectives, or instead whether paradigms that tend to analyze language in a more holistic top-down manner, such as Construction Grammar, are the most propitious formal tools for codifying Cognitive Linguistics and Cognitive Grammar (assuming the formal codification of these theories is even plausible). The literature of Cognitive Linguistics includes a number of interesting attempts to establish formal systems encapsulating cognitive theories, representing perspectives such as Conceptual Space Theory, Conceptual Role Semantics, and Construction Grammar. These are the sorts of formal systems which would be legitimate candidates for instantiation in digital tooling (analogous to GF, for example) so as to implement a robust "cognitive-linguistic computational ecosystem."

Quite possibly such an ecosystem would combine bottom-up theories (such as Link Grammar and Conceptual Space Semantics) and top-down theories (such as Construction Grammar and various Discourse-Representation architectures) and therefore seek to synthesize formal codifications associated with both approaches. 

---
## Some Technical Information

The following details may provide helpful tips for readers who intend to download or clone this repository and compile/run the data-set application so as to view the data set and analyses interactively, as opposed to browsing this material informally on github.

---
**Building**

There are several "build strategies" that can be used to create the dataset application and other libraries.  The quickest option is to use the "`build-quick.pro`" project file in the "`@/code/cpp/projects/qt/qtm/unibuild/dsmain`" folder.  For most users the best option may be "`build-most.pro`" in the same folder (both versions can be built independently).  The "`quick`" version lacks PDF and TCP features (which is explained via a message box when trying to use these features).

A more complex option is "`build-all.pro`", which is only needed for users wishing to generate test scripts or use other advanced features related to "Runtime Reflection".  For developing new code or debugging the executables it may be necessary or easier to use the "`isobuild`" strategy, where each Qt project is built separately, rather than the "`unibuild`" options where projeects are built automatically in order.  Via "`isobuild`" developers can choose which projects to include more precisely.  For the equivalent of "`build-all.pro`" follow the build order listed in "`build-order.txt`".  Note that these comments are only applicable to a small set of users extending or exploring the code in detail.

In general, examining or reusing the dataset code will be easiest for users who 
have some experience with C++ programming and are familiar with the steps 
needed to install C++ dependencies and compile C++ libraries (though the 
dataset has been designed to minimize external dependencies as much as possible).
For a brief review of the kinds of steps which might be needed to set up a 
development environment suitable for this dataset (and typical C++ projects), 
see the "`@/example-build-setup.txt`" file. Of course, most C++ programmers 
will already have an environment suitable for common C++ projects on 
their computer (the dataset only needs very commonplace libraries 
installed, such as __libgl__ or __libz__, and does not require 
a particularly recent compiler).

---
**Downloading**

If you use "`git clone`", it is recommended to provide your own name to the folder where the data set is unpacked.  For example, create a folder called "`NTXH-CognitiveGrammar`" or something smaller ("`ntxh`", say) and _inside_ that folder execute "`git clone -b ctg --single-branch https://github.com/scignscape/ntxh.git ar`" -- notice the trailing "`ar`" (for "archive") where this repository will be unpacked.  Then the parent "`ntxh`" folder can be used for other files related to the project (or follow-up research) but isolated from the actual repo.

Be aware that using long folder names, rather than succinct names like "`ntxh/ar`", may occasionally cause problems (see "`TROUBLESHOOTING`").

---
**Qt**

The Qt libraries and Qt Creator IDE (Integrated Development Environment) are the only known dependencies for most of the project (the exception being specialized use-cases needing Embeddable Common Lisp).  Qt is easy to install from https://www.qt.io/ and is free for noncommercial use. 

Linux users are advised to download and link against a version of Qt different from the Qt libraries bundled with your Desktop Environment.  This is first because Qt versions for Linux desktops are often out-of-date relative to the versions needed by applications, and second because developers should minimize the risk of unintentionally altering the components needed for their Desktop Environment to run.  In effect, the "`qmake`" you use for this repository (and other Qt applications not built-in to your desktop) should live somewhere other than a "`/usr`" subfolder. 

For Windows users, installing Qt is a good way to ensure that you have a C++ development environment, with components like MinGW and g++.  It is recommended to use these tools -- which are based on cross-platform environments -- as opposed to using Microsoft-specific products like the VC++ compiler.

The code in this data set is designed to build and run within Qt Creator.  You do not need a separate build tool like cmake.  Upon loading your preferred project file (e.g., "`build-quick.pro`" or "`build-most.pro`"), you can automatically compile and run the code -- and start exploring the data set -- with Qt Creator's "`Run`" option.  You can use "`build-quick.pro`" to test that your compiler is working properly and quickly browse the data set and then load "`build-most.pro`".  (Note that if you choose "`build-all.pro`" the compile time is noticeably longer -- that's usually not a sign of any problem but see "`TROUBLESHOOTING`".) 

---
**XPDF**

This data set includes a slightly modified version of the open-source XPDF reader (see https://www.xpdfreader.com/).  Users are encouraged to also install the official XPDF Reader.  The version distributed in this repository has been modified to work primarily with this data set.

Users may want to edit the "`@/code/cpp/src/xk/external/xpdf/xpdf/aconf/aconf.h`" file.  If you have (or choose to install) an official XPDF Reader you may want to copy "`aconf.h`" from that code base ("`aconf.h`" will be generated during cmake; this data set provides a default "`aconf.h`" file to eliminate the need for cmake as a build tool for the data set over all).

---
**TROUBLESHOOTING**

1.  Except with "`build-first.pro`", the build process will generate multiple executable files, some for testing or related documentation.  Most of the times users will want the executable called "`dsmain-console`" (the project named "`__run_dsmain-console`") which should run automatically.  However, any executable may be chosen by right-clicking on the project as listed on Qt Creator's Project Panel (usually on the left of the IDE) or by selecting that desired "Run Configuration" from the "Run Settings" section of the "Projects" tab (at the far-left of the IDE).  If the application does not seem to run properly, it may be because the wrong executable selection is chosen for the default Run Configuration, so the "`__run_dsmain-console`" option should be chosen from the drop-down list in "Run Settings".

2.  The data-set project organization uses Qt naming conventions to automatically configure an environment so typical users can easily build and launch the main ("`dsmain-console`") application and other executables.  This process will fail if a working Qt environment (called a Qt "`kit`") is not available _before_ the "`build-quick.pro`" or other project files are opened in the IDE.  It is recommended to double-check that you have a valid kit ("`Options`" -- "`Build & Run`" -- "`Kits`") and start a new, blank session (via the menubar "`File`" -- "`Sessions`" submenu) before starting to use this data set.

3.  Because of a quirk, Qt Creator will on some systems misidentify project files with unusually long paths, causing an endless loop during the build (because "`qmake`" will run repeatedly).  Such behavior has been observed on Windows (but not Linux) and appears to be a bug in older versions of Qt Creator, so more recent users may never encounter this problem, but we'll keep mentioning it in case it arises again.  The problem is most likely to arise if at all for users choosing the more complex "`build-all`" or "`isobuild`" strategies.  If you use these options, keep an eye on the "Compiler Output" window on Qt Creator and make sure "`qmake`" is not running multiple times on one file (this attention ceases to be necessary once the Compiler Output suggests that all of the "`qmake`" files are processed and the compiler has started to generate object files).  If you do encounter a loop, you may either rename the problematic "`.pro`" files -- this repo chooses to give projects relatively long, descriptive names -- or choose shorter names like "`ntxh/ar`" for the repo folder and its parent.  Alternatively, employ the "`isobuild`" approach where you can manually decide when to run "`qmake`".

4.  For XPDF, you may encounter linking problems associated with **freetype** fonts and **png**.  Qt provides its own freetypelib and pnglib which can be used as an alternative to system libs when those are not present. If however you have them on your system then simply uncomment the line "`LIBS += -lfreetype  -lpng`" in the XPDF-console project file to resolve linker errors for those libs.

---
**R/Z**

The R/Z project can be used for more advanced Runtime Reflection for datasets.  As it is still experimental, R/Z is not documented thoroughly.  Please inquire for more information.

R/Z provides a demo of certain concepts involved in engineering Virtual Machines and database query engines, discussed particularly in the _Advances in Ubiquitous Computing_ chapter (by Nathaniel Christen). R/Z is far from a working scripting language, but here it is used as part of a testing framework demonstrated via the main dataset application.

The simplest use-case for R/Z is to build Intermediate Representation files to pass to the main ("`dsmain-console`") application.  In this case the application does not need to embed an R/Z scripting environment in the application itself, just a simpler capability to process Intermediate Representation in the format generated by R/Z (though this representation is not specific to R/Z and could potentially be part of a query-evaluation pipeline, with a front end query language).  A more advanced use-case is to process scripts within a main applications directly, which is outside the scope of this data set.

To check the R/Z environment overall, test the R/Z compiler with "`rz-graph-dynamo-runtime-console`" or one of the other "rz-dynamo" executables.  The "`t1.rz`" file can have sample code like: 

```
,fnd ::: PHR_Fn_Doc*;
fnd \== default; 

,penv ::: PHR_Env*;
penv \= (envv "PHR_Env*");

fnd -> init penv;

,test-fn ::: .(u4)  $-> extern;

fnd -> read  "test-fn";

```

Multiple samples can be executed in sequence with the project/executable called "`rz-multi-console`".

So for example, after confirming that "`t1.rz`" runs properly, try "`rz-multi-console`" which (as coded) will run all scripts listed in the "`@/dev/scripts/rz/demo/multi`" file; "`m1.txt`" lists scripts to run in sequence (e.g. as an informal test suite). 

Testing multiple scripts can also be achieved by including them all in one script with "`<#...>`" notation, e.g. script "`t24.rz`" in "`@/dev/scripts/rz/demo/multi`" has the two lines "`<#t23>`" and "`<#t25>`".

---
**COMMENTS**

R/Z and other advanced features will continue to evolve.  This repository will be updated accordingly.  Please check back!

